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While the use of statistical physics methods to analyze large corpora has been useful to unveil many 
patterns in texts, no comprehensive investigation has been performed investigating the properties of 
statistical measurements across different languages and texts. In this study we propose a framework 
that aims at determining if a text is compatible with a natural language and which languages are 
closest to it, without any knowledge of the meaning of the words. The approach is based on three 
types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, 
from the topology of complex networks representing text, and from intermittency concepts where 
text is treated as a time series. Comparative experiments were performed with the New Testament 
in 15 different languages and with distinct books in English and Portuguese in order to quantify 
the dependency of the different measurements on the language and on the story being told in the 
book. The metrics found to be informative in distinguishing real texts from their shuffled versions 
include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered 
medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with 
natural languages and incompatible with random texts. We also obtain candidates for key- words of 
the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able 
to identify statistical measurements that are more dependent on the syntax than on the semantics, 
the framework may also serve for text analysis in language-dependent applications. 
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I. INTRODUCTION 

Methods from statistics, statistical physics, and arti- 
ficial intelligence have increasingly been used to analyze 
large volumes of text for a variety of applications J]- 
[^ some of which are related to fundamental linguistic 
and cultural phenomena. Examples of studies on human 
behaviour are the analysis of mood change in social net- 
works [1] and the identification of literary movements [3] . 
Other applications of statistical natural language pro- 
cessing techniques include the development of statistical 
techniques to improve the performance of information 
retrieval systems , search engines , machine transla- 
tors [ini[II] and automatic summarizers [T2]. Evidence of 
the success of statistical techniques for natural language 
processing is the superiority of current corpus-based ma- 
chine translation systems in comparison to their counter- 
parts based on the symbolic approach [TO]. 

The methods for text analysis we consider can be clas- 
sified into three broad classes: (i) those based on first- 
order statistics where data on classes of words are used in 
the analysis, e.g. frequency of words [H]; (ii) those based 
on metrics from networks representing text [51 H1I51I71 [TO] : 
(iii) those using intermittency concepts and time-series 
analysis for texts [U [5] . One of the major advantages in- 
herent in these methods is that no knowledge about the 
meaning of the words or the syntax of the languages is 
required. Furthermore, large corpora can be processed 
at once, thus allowing one to unveil hidden text prop- 
erties that would not be probed in a manual analysis 
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given the limited processing capacity of humans. The 
obvious disadvantages are related to the superficial na- 
ture of the analysis, for even simple linguistic phenomena 
such as lexical disambiguation of homonymous words are 
very hard to treat. Another limitation in these statistical 
methods is the need to identify the representative fea- 
tures for the phenomena under investigation, since many 
parameters can be extracted from the analysis but there 
is no rule to determine which are really informative for 
the task at hand. Most significantly, in a statistical anal- 
ysis one may not even be sure if the sequence of words in 
the dataset represents a meaningful text at all. For test- 
ing whether an unknown text is compatible with natural 
language, one may calculate measurements for this text 
and several others of a known language, and then verify 
if the results are statistically compatible. However, there 
may be variability among texts of the same language, 
especially owing to semantic issues. 

In this study we combine measurements from the three 
classes above and propose a framework to determine the 
importance of these measurements in investigations of 
unknown texts, regardless of the alphabet in which the 
text is encoded. The statistical properties of words and 
the books were obtained for comparative studies involv- 
ing the same book (New Testament) in 15 languages and 
distinct pieces of text written in English and Portuguese. 
The purpose in this type of comparison was to iden- 
tify the features capable of distinguishing a meaningful 
text from its shuffled version (where the position of the 
words is randomized), and then determine the proximity 
of pieces of text. 

As an application of the framework, we analyzed the 
famous Voynich Manuscript (VMS), which has remained 
indecipherable in spite of attempts from renowned cryp- 
tographers for a century. This manuscript dates back 
to the 15th century, possibly produced in Italy, and was 
named after Wilfrid Voynich who bought it in 1912. In 
the analysis we make no attempt to decipher VMS, but 
we have been able to verify that it is compatible with nat- 
ural languages, and even identified important keywords, 
which may provide a useful starting point toward deci- 
phering it. 



Raw measurements 



First order statistics 



II. DESCRIPTION OF THE MEASUREMENTS 



The analysis involves a set of steps going beyond the 
basic calculation of measurements, as illustrated in the 
workflow in Fig. [T] Some measurements are averaged in 
order to obtain a measurement on the text level from the 
measurement on the word level. In addition, a compar- 
ison with values obtained after randomly shuffling the 
text is performed to assess to which extent structure is 
reflected in the measurements. 



The simplest measurements obtained are the vocabu- 
lary size M, which is the number of distinct words in 
the text, and the frequency of word i (number of appear- 
ances), denoted by Ni. The heterogeneity of the contexts 
surrounding words was quantified with the so-called se- 
lectivity measurement |16j . If a word is strongly selective 
then it always co-occurs with the same adjacent words. 
Mathematically, the selectivity of a word i is Si — 2Ni/ti, 
where ti is the number of distinct words that appear im- 
mediately beside (i.e., before or after) i in the text. 

A language-dependent feature is the number of differ- 
ent words (types) that at least once had two word tokens 
immediately beside each other in the text. In some lan- 
guages this repetition is rather unusual, but in o the rs it 
may occur with a reasonable frequency (see Sec. Ill and 
Figurejs]). In this paper, the number of repeated bigrams 
is denoted by B. 



2. Network characterization 

Complex networks have been used to characterize 
texts O m m [3 [TFj, where the nodes represent words 
and links are established based on word co-occurrence, 
i.e. links between two nodes are established if the cor- 
responding words appear at least once adjacent in the 
text. j. In most applications of co-occurrence networks, 
the stopwords |27j are removed and the remaining words 
are lemmatized [55]. Here, we decided not to do this 
because in unknown languages it is impossible to derive 
lemmatized word forms or identify stopwords. To charac- 
terize the structure and organization of the networks, the 
following topological metrics of complex networks were 
calculated (more details are given in the Supplementary 
Information (SI)): 

• We quantify degree correlations , i.e. the tendency 
of nodes of certain degree to be connected to nodes 
with similar degree (the degree of a node is the 
number of links it has to other nodes), with the 
Pearson correlation coefficient, r, thus distinguish- 
ing assortative (r > 1) from disassortative (r < 1) 
networks. 

• The so-called clustering coefficient, Ci, is given 
by the fraction of closed triangles of a node, i.e. 
the number of actual connections between neigh- 
bours of a node divided by the possible number of 
connections between them. The global clustering 
coefficient C is the average over the local coeffi- 
cients of all nodes. 

• The average shortest path length , Li, is the short- 
est path between two nodes i and j averaged over 
all possible j's. In text networks it measures the 
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FIG. 1: Illustration of the procedures performed to obtain a measurement X of each book. 



relevance of words according to their distance to 
the most frequent words |4]. 

• The diameter d corresponds to the maximum short- 
est path, i.e. the maximum distance on the network 
between any two nodes. 

• We also characterized the topology of the networks 
through the analysis of motifs, i.e. analysis of con- 
nectivity patterns expressed in terms of small build- 
ing blocks (or subgraphs) [T7| . We define as my the 
number of motifs Y appearing in the network. The 
motifs employed in the current paper are displayed 
in Figure [2j 

3. Intermittency 

The fact that words are unevenly distributed along 
texts has been used to detect keywords in documents [SI 
1181 119j . Since bursty words appear concentrated in por- 
tions of the text in contrast to others, which are dis- 
tributed homogenouly along the text, words with differ- 
ent functions can be distinguished. 

The intermittency was calculated using the concept of 
recurrence times, which have been used to quantify the 
burstiness of time series. In the case of documents, the 
time series of a word is taken by counting the number 
of words (representing time) between successive appear- 
ances of the considered word. For example, the recur- 
rence times for the word 'the' in the previous sentence 
are Ti = 4, = 10, and Tg = 11. If A^^ is the frequency 
of the word its time series will be composed by the fol- 
lowing elements {Ti, T2, ... T/v.-i}. Because the times 
until the first occurrence Tf and after the last occurrence 
Ti are not considered, the element Tn is arbitrarily de- 
fined as Tn = Tf + Ti. Note that with the inclusion of 
Tjv in the time series, the average value over all A^^ values 
is {T)i = N/Ni. Then, to compute the heterogeneity of 
the distribution of a word i in the text, we obtained the 
intermittency li as 




FIG. 2: Illustration of 13 motifs comprising three nodes used 
to analyze the structure of text networks. 

Words distributed by chance have li ~ 1 (for Ni ^ 1), 
while bursty words have Ii> 1. Words with Ni < 5 were 
neglected since they lack statistics. 

B. Prom word to text measurements 

Many of the measurements defined in the previous Sec- 
tion are attributes of the word i. For our aims here it 
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is essential to compare different texts. The easiest and 
most straightforward choice is to assign to a piece of text 
the average value of each measurement X^, computed 
over all M words in the text X = ^ Xi. This was 
done for L, C, I, k and s. One potential limitation of 
this approach is that the same weight is attributed to 
each word, regardless of their frequency in the text. To 
overcome this, we also calculated another metric, X* ob- 
tained as the average of the ij most frequent words, i.e. 
X* = ri~^ ^ Xi, where the sum runs over the rj most 
frequent words. Here, we chose r] = 50. Finally, because 
s is known to have a distribution with long tails |16j . 
we also computed the coefficient 7s of the power-law 
P(s) cx s^'^^ , for which the maximum-likelihood method- 
ology described in was used. 



C. Comparison to shuffled texts 

Since we are interested in measurements capable of dis- 
tinguishing a meaningful text from its shuffled version, 
each of the measurements X and X* described above 
was normalized by the average obtained over 10 texts 
produced using a word shuffling process, i.e. randomizing 
preserving the word frequencies. If ii{X^-^^) and a{X'^-'^'>) 
are respectively the average and the deviation over 10 re- 
alizations of shuffled texts, the normalized measurement 
X and the uncertainty e{X) related to X are: 



X = 



X 



^i{xwy ^iixW) 



(2) 



(3) 



Normalization by the shuffled text is useful because it 
permits comparing each measurement with a null model. 
Hence, a measurement provides significant information 
only if its normalized X value is not e{X) close to X* = 1. 
Moreover, the influence of the vocabulary size M on the 
other measurements tends to be minimized. 



order, e.g., to attribute to which languages A the text is 
compatible with. In practice, we can at best have some 
rows and columns filled and therefore additional statisti- 
cal tests are needed in order to characterize the variation 
of specific measurements. For different texts, P{Xt,i=\) 
denotes the distribution of measurement X across dif- 
ferent texts in a fixed language I = A and P{Xt=r,i) the 
distribution of X across a fixed text t = t written in vari- 
ous languages. Accordingly, //(P) and (j{P) represent the 
expectation and the variation of the distribution P. For 
concreteness, Fig. [3] illustrates the distribution X = B 
(number of duplicated bigrams) for the three sets of texts 
we use in our analysis: 15 books in Portuguese, 15 books 
in English, and 15 versions of the New Testament in dif- 
ferent languages, see SI for details. We consider also the 
average {X) and the standard deviation cr{X) oi X com- 
puted over different books (e.g., each of the three sets of 
15 books) and the correlation Rm between X and the 
vocabulary size M of the book. Table |T] shows the val- 
ues of {X),a{X) and Rm of all measurements in each of 
the three sets of books. In order to obtain further in- 
sights on the dependence of these measurements on lan- 
guage (syntax) and text (semantics), next we perform 
additional statistical analysis to identify measurements 
that are more suitable to target specific problems. 
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III. VARIABILITY ACROSS LANGUAGES AND 
TEXTS 

The measurements described in Section |ll] vary from 
text to text due to the syntactic properties of the lan- 
guage. In a given language, there is also an obvious vari- 
ation among texts on account of stylistic and semantic 
factors. Thus, in a first approximation one may assume 
that variations across texts of a measurement X occur in 
two dimensions. Let Xt^i denotes the value of X for text 
t written in language I. If we had access to the complete 
matrix Xt^i, i.e. if all possible texts in every possible lan- 
guage could be analyzed, we could simply compare a new 
text t to the full variation of the measurements Xt^i in 



FIG. 3: Distribution oi X = B for the New Testament (black 
circles), English (red circles) and Portuguese (blue circles) 
texts. The average (X) for the three sets of texts is repre- 
sented as dashed lines. 



A. Distinguishing books from shuffled sequences 

Our first aim is to identify measurements capable of 
distinguishing between natural and shuffled texts, which 
will be referred to as informative measurements. For in- 
stance, for X = B in Fig. [3] all values are much smaller 
than 1 in all three sets of texts, indicating that this 
measurement takes smaller values in natural texts than 
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TABLE I: Verification of wliich measurements satisfy conditions (i, 1^2, C2 and (3. Rm is the Pearson correlation between X 
and the vocabulary size M. We assume that (i, (2, C2 and ^3 are satisfied respectively when p = 0.00 %, Vt=new,i > 'Vtd=x, 
i{vt=T,i) r\ civt,i=\) < 0.05t(wt=r,;) U L{vt,i=x) and c{Xt=new,i=\, P{Xt,i=\)) > 0.05. Measurements satisfying conditions for all 
three sets of texts are marked with a filled circle (•). 



X 


{X)±a{X) 


p{X^l,{X}) 




c(X,P{X)) 


Rm 


Ci 


C2 


C2 




T = new 


A = en 


A = pt 


T — new 


A = en 


A = pt 


A = en 


A = pt 


A = en 


A = pt 




M 


5, 809 ± 2,665 


4, 720 ± 922 


6,921 ± 1, 126 








3.12 


2.82 


0.00 


0.00 


±1.00 






• 






1.99±0.11 


1.93 ± 0.06 


2.01 ± 0.09 








1.71 


1.25 


0.00 


0.00 


±0.86 










r 


0.91 ± 0.10 


1.10 ± 0.06 


1.15 ± 0.04 


0.00 % 


0.00 % 


0.00 % 


2.18 


3.41 


0.07 


0.14 


±0.07 






• 


• 


d 


1.44 ±0.58 


1.32 ± 0.38 


1.07 ±0.14 


12.50 % 


37.50 % 


43.75 % 


1.41 


3.16 


0.00 


0.00 


±0.08 










L 


1.04 ±0.05 


0.99 ± 0.02 


0.97 ±0.01 


12.50 % 


0.00 % 


0.00 % 


2.07 


7.57 


0.76 


0.68 


±0.20 






• 


• 


L* 


1.08 ±0.04 


1.04 ± 0.02 


1.03 ± 0.01 


0.00 % 


0.00 % 


0.00 % 


2.23 


2.91 


0.80 


0.51 


±0.34 






• 


• 


C 


0.83 ±0.13 


0.97 ± 0.04 


0.97 ±0.03 


0.00 % 


18.75 % 


25.00 % 


3.31 


4.74 


0.65 


0.62 


-0.34 






• 


• 


C 


0.66 ± 0.13 


0.65 ± 0.08 


0.63 ± 0.07 


0.00 % 


0.00 % 


0.00 % 


1.52 


1.71 


0.91 


0.80 


-0.58 








• 


I 


1.30 ±0.07 


1.29 ± 0.14 


1.27 ±0.06 


0.00 % 


0.00 % 


0.00 % 


0.47 


1.03 


0.59 


0.45 


-0.43 








• 


r 


1.32 ± 0.05 


1.32 ± 0.14 


1.26 ± 0.09 


0.00 % 


0.00 % 


0.00 % 


0.36 


0.75 


0.77 


0.95 


-0.26 






• 


• 


B 


0.18± 0.15 


0.05 ± 0.04 


0.10 ± 0.05 


0.00 % 


0.00 % 


0.00 % 


1.01 


11.4 


0.95 


0.32 


±0.27 








• 


k 


0.71 ± 0.06 


0.82 ± 0.03 


0.87 ±0.02 


0.00 % 


0.00 % 


0.00 % 


1.44 


3.99 


0.00 


0.01 


±0.53 






• 




k* 


0.71 ± 0.07 


0.89 ± 0.05 


1.00 ± 0.04 


0.00 % 


0.00 % 


12.50 % 


1.93 


2.81 


0.01 


0.01 


±0.26 






• 




7s 


0.43 ±0.14 


0.51 ± 0.06 


0.47 ±0.07 


0.00 % 


0.00 % 


0.00 % 


2.53 


2.26 


0.88 


0.69 


-0.49 






• 


• 


s 


1.32 ± 0.18 


1.13 ± 0.03 


1.07 ±0.02 


0.00 % 


0.00 % 


0.00 % 


5.06 


8.30 


0.05 


0.25 


-0.51 






• 




s* 


2.09 ±0.84 


1.47 ± 0.08 


1.33 ± 0.10 


0.00 % 


0.00 % 


0.00 % 


7.18 


5.60 


0.48 


0.62 


-0.39 






• 


• 


niA 


0.09 ±0.04 


0.12 ± 0.04 


0.17 ±0.04 


0.00 % 


0.00 % 


0.00 % 


1.31 


1.85 


0.00 


0.00 


±0.02 










rriB 


1.11 ± 0.37 


1.54± 0.11 


1.72 ± 0.07 


0.00 % 


0.00 % 


0.00 % 


3.75 


7.67 


0.00 


0.00 


-0.09 






• 




mc 


0.83 ±0.21 


1.19± 0.10 


1.28 ± 0.05 


18.75 % 


0.00 % 


0.00 % 


2.30 


6.04 


0.00 


0.00 


±0.04 






• 




mo 


0.22 ±0.09 


0.27± 0.11 


0.37 ±0.06 


0.00 % 


0.00 % 


0.00 % 


0.97 


2.45 


0.00 


0.00 


±0.24 










ruE 


0.76 ±0.18 


1.27± 0.16 


1.03 ±0.06 


12.50 % 


6.25 % 


18.75 % 


1.66 


0.72 


0.00 


0.00 


-0.23 










niF 


0.24 ±0.07 


0.37 ±0.05 


0.39 ±0.06 


0.00 % 


0.00 % 


0.00 % 


1.87 


1.80 


0.00 


0.00 


-0.20 










niG 


0.36 ±0.14 


0.47 ±0.09 


0.56 ±0.05 


0.00 % 


0.00 % 


0.00 % 


1.82 


4.43 


0.00 


0.00 


±0.14 










rriH 


0.71 ±0.24 


1.25 ±0.11 


1.16±0.11 


0.00 % 


0.00 % 


0.00 % 


2.67 


3.66 


0.00 


0.00 


-0.17 






• 




mi 


0.20 ±0.07 


0.32 ±0.05 


0.36 ±0.05 


0.00 % 


0.00 % 


0.00 % 


1.68 


2.48 


0.00 


0.00 


-0.14 










mj 


0.45 ±0.17 


0.57 ±0.12 


0.73 ±0.05 


0.00 % 


0.00 % 


0.00 % 


1.76 


5.19 


0.00 


0.00 


±0.11 










rriK 


0.59 ±0.25 


1.22 ±0.16 


1.02 ±0.08 


0.00 % 


12.50 % 


18.75 % 


2.55 


5.29 


0.00 


0.00 


-0.24 






• 




rriL 


0.03 ± 0.02 


0.04 ± 0.02 


0.06 ±0.02 


0.00 % 


0.00 % 


0.00 % 


1.53 


1.85 


0.04 


0.35 


±0.10 










niM 


0.26 ±0.10 


0.39 ±0.06 


0.46 ±0.08 


0.00 % 


0.00 % 


0.00 % 


2.11 


2.16 


0.00 


0.00 


-0.14 






• 





in shuffled texts. In order to quantify the distance of 
a set of values {X} to X = 1 we define the quantity 
p{X = I, {X}) as the proportion of elements in the set 
{X} for which X — 1 lies within the interval X ± e(X), 
where c{X) arises from fluctuations due to the random- 
ness of the shuffling process as defined in eq. ^. This 
leads to condition C,i: 

Ci: X is said to be informative if p{X = I, {X}) 
for \{X}\ 00, 

where {X} is a set of values X obtained over different 
texts in different languages or texts, and \{X}\ is the 
number of elements in this set. 

We now discuss the results obtained applying (i (with 
p{X — 1,{X}) — 0) for all three sets of texts in 
our database for each of the measurements described in 
Sec. [n] Measurements which satisfied are indicated 



by a • in Tab. [Tj Several of the network measurements 
(d, L, C, k* and motifs mc, tue and ttik) do not fully 
satisfy Ci- Consequently they cannot be used to distin- 
guishing a manuscript from its shuffled version. This 
finding is rather surprising because some of the latter 
measurements were proven useful to grasp subtleties in 
text, e.g. for author recognition [3]. In the latter appli- 
cation, however, the networks representing text did not 
contain stopwords and the texts were lemmatized. The 
averaging over the 50 most frequent words seems to be 
essential to satisfy (i for the clustering coefficient and 
for the shortest paths (note that C* and L* are infor- 
mative while C and L are not). This means that the 
informativeness of these quantities is concentrated in the 
most frequent words. On the other hand, for the degree, 
an opposite effect occurs, i.e., k is informative and k* 
is not. The informativeness of intermittency (J and /*) 
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may be explained by the fact by construction li 
shuffled texts (see Sec. II A3) 



I in 

Because in natural texts 
many words tend to appear clustered in regions > I 
and /* > 1. The selectivity s is also strongly affected 
by the shuffling process. Words in shuffled texts tend to 
be less selective, which yields an increase in 7^ [16] (i.e., 
very selective words occur very sporadically) and a de- 
crease in s and s* . The selectivity is related to the effect 
of word consistency [21] (see Ref . [5T] ) which was verified 
to be common in English, especially for very frequent 
words. The number of bigrams B is also informative, 
which means that in natural languages it is unlikely that 
the same word is repeated (when compared with random 
texts). As for the informative motifs, rriA, ™_d, mp, ma, 
TO/, TOj, rri]^ and itim rarely occur in natural language 
texts {{X) < 1) while motif ms was the only measure- 
ment taking values above and below 1. The emergence 
of this motif therefore appears to depend on the syntax, 
being very rare for Xhosa, Vietnamese, Swahili, Korean, 
Hebrew and Arabic. 



B. Dependence on style and language 

We are now interested in investigating which text- 
measurements are more dependent on the language than 
on the style of the book, and vice-versa. Measurements 
depending predominantly on the syntax are expected to 
have larger variability across languages than across texts. 
On the other hand, measurements depending mainly on 
the story (semantics) being told are expected to have 
larger variability across texts in the same language, i.e. 
t = T |30j . The variability of the measurements was com- 
puted with the coefficient of variation v = a{X)/{X), 
where <t(X) and {X) represent respectively the standard 
deviation and the average computed for the books in the 
set {X}. Thus, we may assume that X is more dependent 
on the language than on the style/semantics if condition 
(2 is satisfied: 

C,2- X is more dependent on the language (or syntax) 
than it is on the style (or semantics) if Vt=T.i > 

Vt,l=\- 

Measurements failing to comply with condition ^2 have 
vt,i=\ > vt=T.i and therefore are more dependent on the 
style/semantics than on the language/syntax. In order 
to quantify whether vt=r,i > vt,i=\ or Vt,i=\ > Vt=T,i is 
statistically significant, we took the confidence interval of 
Vt=T,i and Vt^i=\. Let l{v) be the confidence interval for 
V computed using the noncentral t-distribution |22| , then 
C2 is valid if there is little intersection of the confidence 
intervals. In other words: 

C2: The inequality Vt=r,i > vtj=\ (or Vt,i=\ > Vt=r,i) is 
valid only if L{vt=r,i) n i{vt,i=x) for |{X}| 
00. 

The confidence intervals were assumed to have lit- 
tle intersection if L{vt=T,i) H i{vt,i=\) < 0.05 x 



i{vt=r,i) U i{vt,i=\). We took a significance level a = 0.95 
in the construction of the confidence intervals. 

The results for the measurements satisfying conditions 
(2 and ^2 are shown in Tab. |l] Measurements satisfying 
conditions C2 and C2 serve to examine the dependency 
on the syntax or on the style/semantics. The vocabulary 
size M, and the network measurements r, L, L* , C, k 
and k* are more dependent on syntax than on semantics. 
The measurements derived from the selectivity (7^ , s and 
s*) are also strongly dependent on the language. With 
regard to the motifs, five of them satisfy C2 and C2: ttib, 
"niCi 'T^-ff) fn,K and tom- Remarkably, / and /* are the 
only measurements with low values of Vt=new,i/vt.i=x- 
Reciprocally, the only measurement which statistically 
significantly violated (,2 (i.e., satisfied C2) was /*. This 
confirms that the average intermittency of the most fre- 
quent words is more dependent on the style than on the 
language. 



C. On the representativeness of measurements 

The practical implementation of our general framework 
was done quantifying the variation across languages us- 
ing a single book (the New Testament). This was done 
because of the lack of available books in a large number 
of languages. In order for this approach to work it is es- 
sential to determine whether fluctuations across different 
languages are representative of the fluctuations observed 
in different books. We now try to determine the measure- 
ments X whose actual values of a single book on a specific 
language A {Xt=new,i=x) are compatible to other books 
in the same language {Xt^i=\). To this end we define 
the compatibility c{X,P) of Xt=newA=\ to P{Xt,i=\). 
The distribution P was taken with the Parzen-windowing 
interpolation [23 using a Gaussian function as kernel. 
More precisely, P was constructed adding Gaussian dis- 
tributions centered around each X observed over different 
texts in a fixed language A. Mathematically, the compat- 
ibility c(Ar, P) is computed as 



^(X P) = {'^'' P(X)rfX if X < X^edia 

\ 2 X P{X)dX if X > X,„edia 



(4) 



where A^modian is the median of P{X). For practical pur- 
poses, we consider that Xt^new,i=x is compatible with 
other books written in the same language A if is ful- 
filled: 

^3: Xt=new.i is a representative measurement of the 
language A \i c{Xt^new,i=x, P{Xt^i=x)) > 0.05. 

The representativeness of the measurements computed 
for the New Testament was checked using the distribu- 
tion P{X) obtained from the set of books written in Por- 
tuguese and English. The standard deviation employed 
in the Parzen method was the worst deviation between 
English and Portuguese, i.e. a = minjcTpt, aon}- The 
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measurements satisfying ^3 for both English and Por- 
tuguese datasets are displayed in the last column of Tab. 
|Tj With regard to the network measurements, only L, L* , 
C and C* are representative, suggesting that they are 
weakly dependent on the variation of style (obviously as- 
suming the New Testament as a reference). In addition, 
/, /*, 7s, s* and m^^ turned out to be representative 
measurements . 



reasoning applies to selecting measurements to identify 
the closest style. Finally, note that answers for Q3 and 
Q4 depend on a comparison with the New Testament in 
our dataset. Hence, suitable measurements must fulfill 
condition ^3 in order to ensure that the measurements 
computed for the New Testament are representative of 
the language. 



IV. CASE STUDY: THE VOYNICH 
MANUSCRIPT (VMS) 

So far we have introduced a framework for identifying 
the dependency of different measurements on the lan- 
guage and story of different books. We now investigate 
which extent the measurements we identified as relevant 
can provide information on analysis of single texts. The 
Voynich Manuscript (VMS) , named after the book dealer 
Wilfrid Voynich who bought the book in the early XX 
century, is a 240 page folio that dates back to the XV 
century. Its mysterious aspect has captivated people's 
attention for centuries. Indeed, VMS has been studied by 
professional cryptographers, being a challenge to schol- 
ars and decoders (24. .25j , currently included among the 
six most important ciphers |24j . The various hypotheses 
about VMS can be summarized into three categories: (i) 
A sequence of words without a meaningful message; (ii) 
a meaningful text written originally in an existing lan- 
guage which was coded (and possibly encrypted) in the 
Voynich alphabet; and (iii) a meaningful text written in 
an unknown (possibly constructed) language. While it 
is impossible to investigate systematically all these hy- 
potheses, here we perform a number of statistical analy- 
sis which aim at clarifying the feasibility of each of these 
scenarios. To address point (i) we analyze shuffled texts. 
To address point (ii) we consider 15 different languages, 
including the artificial language Esperanto that allows us 
to touch on point (iii) too. We do not consider the effect 
of encryption of the text. 

The statistical properties of VMS were obtained to try 
and answer the questions posed in Tab.[lTj which required 
checking the measurements that would lead to statisti- 
cally significant results. To check whether a given text 
is compatible with its shuffled version, X computed in 
texts written in natural languages should always be far 
from X — 1, and therefore only informative measure- 
ments are able to answer question Qi. To test whether a 
text is consistent with some natural language (question 
Q2), the texts employed as basis for comparison (i.e., 
the New Testament) should be representative of the lan- 
guage. Accordingly, condition ^3 must be satisfied when 
selecting suitable measurements to answer Q2. C2 and 
^2 must be satisfied for measurements suitable to answer 
Q3 because the variance in style within a language should 
be small, if one wishes to determine the most similar lan- 
guage. Otherwise, an outlier text in terms of style could 
be taken as belonging to another language. An analogous 



A. Is the VMS distinguishable from its shuffled 
text? 

Before checking the compatibility of the VMS with 
shuffled texts, we verified if Qi can be accurately an- 
swered in a set of books written in Portuguese and 
English, henceforth referred to as test dataset (see Si- 
Tab. 3). A given test text was considered as not shuffled 
if the interval X — e{X) to AT -I- e(A) does not include 
X = \. To quantify the distance of a text from its shuf- 
fled version, we defined the distance D: 



which quantifies how many e's the value X is far from 
X = 1. As one should expect, the values of X computed 
in the test dataset for A = pt and A = en (see SI- Tab. 4) 
indicate that all texts are not compatible with its shuffled 
version because D > 1, which means that the interval 
from X — e{X) to A -t- e{X) does not include X — 1. 
Once the methodology appropriately classified the texts 
in the test dataset as incompatible with their shuffled 
versions, we are now in position to apply it to the VMS. 

The values of X for the VMS, denoted as AvmS: in 
Tab. |III| indicate that the VMS is not compatible with 
shuffled texts, because the interval from AvMS~e(AvMs) 
to AvMS + e(AvMs) does not include X ~ 1. All but 
one measurement (C*) include A = 1 in the interval 
AvMS i e(AvMs), suggesting that the word order in the 
VMS is not established by chance. The property of the 
VMS that is most distinguishable from shuffled texts was 
determined quantitatively using the distance I?vms from 
eq. ([5|. Tab. |III| shows the largest distances for in- 
termittency (/ and /*) and network measurements (k 
and L*). Because intermittency is strongly affected by 
stylistic/semantic aspects and network measurements are 
mainly influenced by syntactic factors, we take these re- 
sults to mean that the VMS is not compatible with shuf- 
fled, meaningless texts. 

B. Is the VMS compatible with a text in natural 
languages ? 

The compatibility with natural languages was checked 
by comparing the suitable measurements for the VMS 
with those for the New Testament written in 15 lan- 
guages. Similarly to analysis of compatibility with shuf- 
fled texts, we validated our strategy in the test dataset as 
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TABLE II: The conditions that must be fulfilled by the measurements for answering each of the Questions posed. For Qi, X 
should not be close to X = 1 because X « 1 in shuffled texts. In the case of Qa, it is desirable that there is no intersection 
between the measurements computed for books belonging to different languages. Therefore C,2 and C,2 should be fulfilled. To 
find the closest style, the measurement must be strongly dependent on style, i.e. only ("2 should be fulfilled. Finally, if a question 
involves a comparison of the unknown manuscript with the New Testament then it requires that the measurements employed 
are representative. Therefore, Q2, Q3 and Q4 require the fulfillment of condition ^3. 



Questions 


Ci 


C2 




Ca 


Qi 


Is the text compatible with shuffled version? 


• 








Q2 


Is the text compatible with a natural language? 








• 


Qs 


Which language is closer to the manuscript? 




• 


• 


• 


Q4 


Which style is closer to the manuscript? 






• 


• 



TABLE III: Values of X for the Voynich Manuscript considering only the informative measurements (i.e., the measurements 
satisfying ("1). Apart from C* all measurements point to the VMS being different from shuffled texts. 



X 


L* 


C* 


I 


/' 


B 


k 


Is 


mo 


mp 


mj 


mo 


mi 


mM 


niA 


rriL 


XvMS — e(X'vMs) 


1.069 


0.981 


1.423 


1.875 


2.333 


0.948 


0.617 


0.782 


0.738 


0.784 


0.908 


0.724 


0.783 


0.728 


0.549 


XvMS 


1.071 


0.999 


1.433 


1.890 


2.637 


0.949 


0.692 


0.796 


0.751 


0.798 


0.940 


0.733 


0.801 


0.739 


0.582 


XvMS + e(XvMs) 


1.072 


1.017 


1.443 


1.904 


2.940 


0.950 


0.768 


0.809 


0.765 


0.813 


0.971 


0.741 


0.819 


0.751 


0.616 


-DvMS 


47 





44 


61 


5 


51 


23 


15 


18 


14 


2 


32 


11 


23 


12 



follows. The compatibility with natural texts was com- 
puted using eq. Q, where P was computed from the 
New Testament dataset. The standard deviation on each 
Gaussian representing a book in the test dataset should 
be proportionally to the variation of X across different 
texts and therefore we used the worst a between English 
and Portuguese. The values displayed in SI- Tab. 5 reveal 
that all books are compatible with natural texts, as one 
should expect. Therefore we have good indications the 
proposed strategy is able to properly decide whether a 
text is compatible with natural languages. 



The distance from the VMS to the natural lan- 
guages was estimated by obtaining the compatibility 
c{XYMS,P{Xt=ncw,i)) (see eq. |4|. In this case, P 
was constructed adding Gaussian aistributions centered 
around each X observed in the New Testament over 
different languages A. The distribution P for three 
measurements is illustrated in Fig. |4j The values of 
c{Xyms7 P{Xt=ncw,i)) displayed in Tab. |IV| confirm that 
VMS is compatible with natural languages for most of 
the measurements suitable to answer Q2. The excep- 
tions were B and /* . A large S is a particular feature of 
VMS because the number of duplicated bigrams is much 
greater than the expected by chance, unlike natural lan- 
guages. /* is higher for VMS than the typically observed 
in natural languages (see Fig. Qa)), even though the ab- 
solute intermittence value of the most frequent words in 
VMS is not far from those for natural languages. Since 
the intermittency / is related to large scale distribution 
of a (key) word in the text, we speculate that the reason 
for these observations may be the fact that the VMS is 
a compendium of different topics. 



TABLE IV: Compatibility of VMS with natural languages. 
Except for /* and B, the measurements computed for VMS 
are consistent with those expected for texts written in natural 
languages. 



X 


r 


L 


L* 


C 


C* 


I 


/* 


B 


s* 


Is 


c 


0.14 


0.62 


0.99 


0.96 


0.05 


0.39 


0.00 


0.00 


0.09 


0.12 



C. Which language/style is closer to the VMS? 

We address this question in full generality but we shall 
show that with the limited dataset employed, we can- 
not obtain a faithful prediction of the language of a 
manuscript. Given a text r, we identify the most similar 
language according to the following procedure. We first 
calculate the Euclidean distance (using the z-normalized 
values of the measurements suitable to answer Q3 in 
Tab. |ll]) between the book under analysis and the ver- 
sions of the New Testament. Let -Ra.t be the ranking ob- 
tained by language A in the text t. Given a set of texts 
T written in the same language, this procedure yields a 
list of i?A.T for each t € T- In this case, it is useful to 
combine the different -Ra,t by considering the product of 
the normalized ranks 



R\.T 



(6) 



where |T| is the number of texts in the database T- This 
choice is motivated by the fact that R\^r/\T\ corresponds 
to the probability of achieving by chance a rank as good 
as i?A,T so that Sx in Eq. ^ corresponds to the prob- 
ability of obtaining such a ranking by chance in every 
single case. By ranking the languages according to S\ we 
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FIG. 4; Distribution of measurements for the New Testament compared with the measurement obtained for VMS (dotted 
hne). The measurements are (a) X = I* (intermittency of the most frequent words); (b) X = r (assortativity) and (c) X = L 
(average shortest path length). While in (a) VMS is not compatible with natural languages, in (b) and (c) the compatibility 
was verified since c{Xyms, P) > 0.05. 



obtain a rank of best candidates for the language of the 
texts in T- 

In our control experiments with |T| = 15 known texts 
we verified that the measurements suitable to answer Q3 
led to results for the books in Portuguese and English of 
our dataset which not always coincide with the correct 
language. In the case of the Portuguese test dataset, 
Portuguese was the second best language (after Greek), 
while in the English dataset the most similar languages 
were Greek and Russian and English was only in place 6. 
Even though the most similar language did not match the 
language of the books, the Sx obtained were significantly 
better than chance (p-value=1.0 10~7 and 4.3 10~^, re- 
spectively in the English and Portuguese test sets). 

The reason why the procedure above was unable to 
predict the accurate language is directly related to the 
use of only one example (a version of the New Testa- 
ment) for each language, while in robust classification 
methods many examples are used for each class. Hence, 
finding the most similar language to VMS will require 
further efforts, with the analysis of as many as possible 
books representing each language, which will be a chal- 
lenge since there are not many texts widely translated 
into many languages. 



D. Keywords of the VMS 

One key problem in information sciences is the detec- 
tion of important words as they offer clues about the text 
content. In the context of decryption, the identification 
of keywords may be helpful for guiding the deciphering 
process, because cryptographers could focus their atten- 
tion on the most relevant words. Traditional techniques 



are based on the analysis of frequency, such as the widely 
used term frequency-inverse document frequency |14j (tf- 
idf). Basically, it assigns a high relevance to a word if 
it is frequent in the document under analysis but not in 
other documents of the collection. The main drawback 
associated with this approach is the requirement of a set 
of representative documents in the same language. Obvi- 
ously, this restriction makes it impossible to apply tf-idf 
to the VMS, since there is only one document written 
in this "language". Another possibility would be to use 
entropy-based methods [TH] to detect keywords. How- 
ever, the application of all these methods to cases such 
as the VMS will be limited because they typically re- 
quire the manuscript to be arranged in partitions, such 
as chapters and sections, which are not easily identified 
in the VMS. 

To overcome this problem, we use the fact that key- 
words show high intermittency inside a single text [5lll9j. 
Therefore, this feature can play the role traditionally 
played by the inverse document frequency (idf). In agree- 
ment with the spirit of the tf-idf analysis, we define the 
relevance Qi of word i as proportional to both the inter- 
mittency and frequency as follows: 

= (/,-l)v/logiV.. (7) 

Note that with the factor words with / ~ 1 receive 
low values of Q even if they are very frequent. There 
are other methods for detecting keywords relying on the 
analysis of the uneven distribution of the words |26) , but 
we decided not to use them because they generate bet- 
ter results for short texts, which is not the case of VMS. 
For the case of small texts and small frequency, correc- 
tions on our definition of intermittency should be used, 
see Ref. which also contains alternatives methods 
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TABLE V: Keywords of the New Testament (English, Por- 
tuguese and German) and the VMS using Eq. 0. 



Portuguese 


English 


German 


Voynich 


nasceu 


begat 


zeugete 


cthy 


Pilatos 


Pilates 


zentner 


qokeedy 


ceus 


talents 


himmelreich 


shedy 


bem-aventurados 


loaves 


pilatus 


qokain 


Isabel 


Herod 


schwert 


chor 


an jo 


tares 


Maria 


Ikaiin 


menino 


vineyard 


Elisabeth 


qol 


vinha 


shall 


Etliches 


Ichedy 


sumo 


boat 


unkraut 


sho 


sepulcro 


demons 


euch 


qokaiin 


joio 


five 


schiflF 


olkeedy 


Maria 


pay 


ihn 


qokal 


portanto 


sabbath 


weden 


qotain 


Herodes 


hear 


heuchler 


dehor 


talentos 


whosoever 


tempel 


otedy 



for the computation of key-words from intermittency. In 
order to validate we applied Eq. ([T]) to the New Testa- 
ment in Portuguese, English and German. An inspection 
of Tab. |V] for Portuguese, English and German indicates 
that representative words have been captured, such as the 
characters "Pilates", "Herod", "Isabel" and "Maria" and 
important concepts of the biblical background such as 
"nasceu" (was born), "ceus" / "himmelreich" (heavens), 
"heuchler" (hypocrite) , "demons" and "sabbath" . In the 
right column of Tab. [V| we present the Hst of 'words ob- 
tained for the VMS through the same procedure, which 
are natural candidates as keywords. 



unknown piece of text, recognized as such by the pres- 
ence of a sequence of symbols organized in "words" , is 
a meaningful text and which language or style is closer 
to it. The framework encompassed statistical analysis 
of individual words and then books using three types of 
measurements, namely metrics obtained from first-order 
statistics, metrics from networks representing text and 
the intermittency properties of words in a text. We iden- 
tify a set of measurements capable of distinguishing be- 
tween real texts and their shuffled versions, which were 
referred to as informative measurements. With further 
comparative studies involving the same text (New Tes- 
tament) in 15 languages and distinct books in English 
and Portuguese, we could also find metrics that depend 
on the language (syntax) to a larger extent than on the 
story being told (semantics). Therefore, these measure- 
ments might be employed in language-dependent appli- 
cations. Significantly, the analysis was based entirely on 
statistical properties of words, and did not require any 
knowledge about the meaning of the words or even the 
alphabet in which texts were encoded. 

The use of the framework was exemplified with the 
analysis of the Voynich Manuscript, with the final con- 
clusion that it differs from a random sequence of words, 
being compatible with natural languages. Even though 
our approach is not aimed at deciphering Voynich, it was 
capable of providing keywords that could be helpful for 
decipherers in the future. 
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